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Abstract 

We propose a multiple imputation method to deal with incomplete categorical data. 

This method imputes the missing entries using the principal components method ded¬ 
icated to categorical data: multiple correspondence analysis (MCA). The uncertainty 
concerning the parameters of the imputation model is reflected using a non-parametric 
bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small 
number of parameters due to the dimensionality reduction property of MCA. It allows 
the user to impute a large range of data sets. In particular, a high number of cate¬ 
gories per variable, a high number of variables or a small the number of individuals 
are not an issue for MIMCA. Through a simulation study based on real data sets, the 
method is assessed and compared to the reference methods (multiple imputation using 
the loglinear model, multiple imputation by logistic regressions) as well to the latest 
works on the topic (multiple imputation by random forests or by the Dirichlet process 
mixture of products of multinomial distributions model). The proposed method shows 
good performances in terms of bias and coverage for an analysis model such as a main 
effects logistic regression model. In addition, MIMCA has the great advantage that 
it is substantially less time consuming on data sets of high dimensions than the other 
multiple imputation methods. 

Keywords\mSssm.g values, categorical data, multiple imputation, multiple correspondence 
analysis, bootstrap 

1 Introduction 

Data sets with categorical variables are ubiquitous in many helds such in social sciences, 
where surveys are conducted through multiple-choice questions. Whatever the held, missing 
values frequently occur and are a key problem in statistical practice since most of statistical 
methods cannot be applied directly on incomplete data. 

To deal with missing values one solution consists in adapting the statistical method so 
that it can be applied on an incomplete data set. For instance, the maximum likelihood (ML) 
estimators can be derived from incomplete data using an Expectation-Maximization (EM) 
algorithm [1] and their standard error can be estimated using a Supplemented Expectation- 
Maximization algorithm [2]. The ML approach is suitable, but not always easy to establish 
0 . 

Another way consists in replacing missing values by plausible values according to an 
imputation model. This is called single imputation. Thus, the data set is complete and 
any statistical method can be applied on this one. Figure [T] illustrates three simple single 
imputation methods. The data set used contains 1000 individuals and two variables with 
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Figure 1: Illustration of three imputation methods for two categorical variables: the top part de¬ 
scribed how the data are built (marginal and conditional proportions, associated com¬ 
plete data, incomplete data set generated where NA denotes a missing value) and the 
bottom part sums up the observed conditional proportions after an imputation by sev¬ 
eral methods (majority, regression, stochastic regression). The last line indicates the 
coverage for the confidence interval for the proportion of b over 1000 simulations. 


two categories: A and B for the first variable, a and b for the second one. The data set 
is bnilt so that 40% of the individnals take the category a and 60% the category b. In 
addition, the variables are linked, that is to say, the probability to observe a or 6 on the 
second variable depends on the category taken on the hrst variable. Then, 30% of missing 
valnes are generated completely at random on the second variable. A hrst method conld be to 
impnte according to the most taken category of the variable. In this case, all missing valnes 
are impnted by a. Conseqnently, marginal proportions are modihed, as well as conditional 
proportions (see the bottom part of Fignre [^. This method is clearly not snitable. A 
more convenient solntion consists in taking into acconnt the relationship between the two 
variables, following the rationale of the impntation by regression for continuous data. To 
achieve this goal, the parameters of a logistic regression are estimated from the complete 
cases, providing htted conditional proportions. Then, each individual is imputed according 
to the highest conditional proportion given the hrst variable. This method respects the 
conditional proportions better, but the relationship between variables is strengthened which 
is not satisfactory. In order to obtain an imputed data set with a structure as close as possible 
to the generated data set, a suitable single imputation method is to perform stochastic 
regression: instead of imputing according to the the most likely category, the imputation is 
performed randomly according to the htted probabilities. 

An imputation model used to perform single imputation has to be sufficiently complex 
compared to the statistical method desired (the analysis model). For instance, if the aim is 
to apply a logistic regression from an incomplete data set, it requires using an imputation 
model taking into account the relationships between variables. Thus, a suitable single impu¬ 
tation method, such as the stochastic regression strategy, leads to unbiased estimates of the 
parameters of the statistical method (see Figure [^. However, although the single imputation 
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method respects the structure of the data, it still has the drawback that it leads to underes¬ 
timate the variability of the estimators because the uncertainty on the imputed values is not 
taken into account in the estimate of the variability of the estimators. However, although 
the single imputation method respects the structure of the data, it still has the drawback 
that the uncertainty on the imputed values is not taken into account in the estimate of the 
variability of the estimators. Thus, this variability remains underestimated. For instance, in 
Figure [T} the level of the conhdence interval of tt?,, the proportion of b, is 89.9% and does not 
reach the nominal rate of 95%. 

Multiple imputation (MI) [H [5] has been developped to avoid this issue. The principle of 
multiple imputation consists in creating M imputed data sets to reflect the uncertainty on 
imputed values. Then, the parameters of the statistical method, denoted -0, are estimated 
from each imputed data set, leading to M sets of parameters Lastly, these sets of 

parameters are pooled to provide a unique estimation for and for its associated variability 
using Rubin’s rules [Ij. 

MI is based on the ignorability assumption, that is to say ignoring the mechanism that 
generated missing values. This assumption is equivalent to: hrst, the parameters that govern 
the missing data mechanism and the parameters of the analysis model are independent; then, 
missing values are generated at random, that is to say, the probability that a missing value 
occurs on a cell is independent from the value of the cell itself. In practice, ignorability 
and value missing at random (MAR), are used interchangeably. This assumption is more 
plausible when the number of variables is high ilE], but remains difficult to verify. 

Thus, under the ignorability assumption, the main challenge in multiple imputation is 
to reflect the uncertainty of the imputed values by reflecting properly [H p. 118-128] the 
uncertainty on the parameters of the model used to perform imputation to get imputed 
data sets yielding to valid statistical inferences. To do so, two classical approaches can be 
considered. The hrst one is the Bayesian approach: a prior distribution is assumed on the 
parameters 6 of the imputation model, it is combined with the observed entries, providing 
a posterior distribution from which M sets of parameters ( 9m ) are drawn. Then, 

\ / l<m<M 

the incomplete data set is imputed M times using each set of parameters. The second one 
is a bootstrap approach: M samples with replacement are drawn leading to M incomplete 
data sets from which the parameters of the imputation model are obtained. The M sets of 
parameters {9m)i<m<M then used to perform M imputations of the original incomplete 
data set. 

In this paper, we detail in Section 2 the main available MI methods to deal with categorical 
data. Two general modelling strategies can be distinguished for imputing multivariate data: 
joint modelling (JM) [6] and fully conditional specihcation (FCS)[8]. JM is based on the 
assumption that the data can be described by a multivariate distribution. Concerning FCS, 
the multivariate distribution is not dehned explicitly, but implicitly through the conditional 
distributions of each variable only. Among the presented methods, three are JM methods: 
MI using the loglinear model, MI using the latent class model and MI using the normal 
distribution; the two others are FCS strategies: the FCS using logistic regressions and FCS 
using random forests [9]. In Section 3, a novel JM method based on a principal components 
method dedicated to categorical data, namely multiple correspondence analysis (MCA), is 
proposed. Principal components methods are commonly used to highlight the similarities 
between individuals and the relationships between variables, using a small number of principal 
components and loadings. MI based on this family of methods uses these similarities and 
these relationships to perform imputation, while using a restricted number of parameters. 
The performances of the imputation are very promising from continuous data [101E] which 
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motivates the consideration of a method for categorical data. In Section 4, a simnlation stndy 
based on real data sets, evalnates the novel method and compares its performances to other 
main mnltiple impntation methods. Lastly, conclnsions abont MI for categorical data and 
possible extensions for the novel method are detailled. 

2 Multiple imputation methods for categorical data 

The impntation of categorical variables is rather complex. Indeed, contrary to continnons 
data, the variables follow a distribntion on a discrete snpport defined by the combinations 
of categories observed for each individnal. Becanse of the explosion of the nnmber of com¬ 
binations when the nnmber of categories increases, the nnmber of parameters defining the 
mnltivariate distribntion conld be extremely large. Conseqnently, dehning an impntation 
model is not straightforward for categorical data. In this section we review the most popnlar 
approaches commonly nsed to deal with categorical data: JM nsing the loglinear model, JM 
nsing the latent class model, JM nsing the normal distribntion and FCS nsing mnltinomial 
logistic regression or random forests. 

Hereinafter, matrices and vectors will be in bold text, whereas sets of random variables or 
single random variables will not. Matrices will be in capital letters, whereas vectors will be in 
lower case letters. We denote X/xii: a data set with / individnals and K variables. We note 
the observed part of X by 'K.ots and the missing part by X^iss, so that X = (X.obs,'^miss)- 
Let qk denote the nnmber of categories for the variable X^, J = XlfLi Qk the total nnmber 
of categories. We note P (X, 9) the distribntion of the variables X = (Xi,..., Xk), where 9 
is the corresponding set of parameters. 

2.1 Multiple imputation using a loglinear model 

The satnrated loglinear model (or mnltinomial model) [1^ consists in assnming a mnltino¬ 
mial distribntion Ai {9, 1) as joint distribntion for X, where 9 = {9xj^...xk)xi xk ^ vector 
indicating the probability to observe each event (Xi = xi,... ,Xk = xk)- Performing MI 
with the loglinear model [H] is often achieved by reflecting the variability of the impntation 
model’s parameters with a Bayesian approach. More precisely, a Bayesian treatment of this 
model can be specified as follows: 


X\9^Mi9,l) 
9 ~ V{a) 
9\X ~ V{a + 9^^) 


( 1 ) 

( 2 ) 

(3) 


where 'D{a) denotes the Dirichlet distribntion with parameter a, a vector with the same 
dimension as 9 and 9^^ is the maximnm likelihood for 9, corresponding to the observed 
proportions of each combination in the data set. A classical choice for aisQ: = (1/2,...,1/2) 
corresponding to the non-informative Jeffreys prior PI- Combining the prior distribntion 
and the observed entries, a posterior distribntion for the model’s parameters is obtained 
(Eqnation (|^). 

Becanse missing valnes occnr in the data set, the posterior distribntion is not tractable, 
therefore, drawing a set of model’s parameters in it is not straightforward. Thns, a data- 
angmentation algorithm [IT] is nsed. In the first step of the algorithm, missing valnes are 
impnted by random valnes. Then, becanse the data set is now completed, a draw of 9 
in the posterior distribntion ([^ can easily be obtained. Next, missing valnes are impnted 
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from the predictive distribution ([^ using the previously drawn parameter and the observed 
values. These steps of imputation and draw from the posterior distribution are repeated 
until convergence. At the end, one set of parameters 9^, drawn from the observed posterior 
distribution, is obtained. Repeating the procedure M times in parallel, M sets of parameters 
are obtained from which multiple imputation can be done. In this way, the uncertainty on 
the parameters of the imputation model is reflected, insuring a proper imputation. 

The loglinear model is considered as the gold standard for MI of categorical data [T5] . 
Indeed, this imputation model reflects all kind of relationships between variables, which en¬ 
ables applying any analysis model. However, this method is dedicated to data sets with a 
small number of categories because it requires a number of independent parameters equal 
to the number of combinations of categories minus 1. For example, it corresponds to 
9 765 624 independent parameters for a data set with iF = 10 variables with qk = 5 
categories for each of them. This involves two issues: the storage of 6 and overfitting. 
To overcome these issues, the model can be simplified by adding constraints on 6. The 
principle is to write log{9) as a linear combination of a restricted set of parameters A = 
[-^0) -^11) • • • ) ^XKI ■ ■ ■ 1 ^X1X2J ■ ■ ■ 1 ^XIXK^ • • • ) ^xk-ixk\ 1 where each element is indexed by a cat¬ 
egory or a couple of categories. More precisely, the constraints on 9 are given by the following 
equation: 


log{9x^,„^j^) = Ao + ^ ^ for all (Xi = xi,... ,Xk = xr) (4) 

k (k,k') 

k^k' 

where the second sum is the sum over all the couples of categories possible from the set 
of categories (xi,... ,xk). Thus, the imputation model reflects only the simple (two-way) 
associations between variables, which is generally sufficient. Equation (|^ leads to 760 in¬ 
dependent parameters for the previous example. However, although it requires a smaller 
number of parameters, the imputation under the loglinear model still remains difficult in this 
case, because the data-augmentation algorithm used [HI p.320] is based on a modihcation of 
9 at each iteration and not of A. Thus the storage issue remains. 


2.2 Multiple imputation using a latent class model 

To overcome the limitation of MI using the loglinear model, another JM method based on the 
latent class model can be used. The latent class model m p.535] is a mixture model based 
on the assumption that each individual belongs to a latent class from which all variables can 
be considered as independent. More precisely, let Z denote the latent categorical variable 
whose values are in {1,..., L}. Let 9z = denote the proportion of the mixture and 

9x = ( 9x ^) the parameters of the L components of the mixture. Thus, let 9 = (9z, 9x) 
V / 1<1<L 

denote the parameters of the mixture, the joint distribution of the data is written as follows: 


L / K 

P(X = (xi,...,Xi,);0) = Y,lF(Z = e,9z)llF(Xk = Xk\Z = e;9i^^) 

£=1 \ k=l 


(5) 


Assuming a multinomial distribution for Z and X\Z, Equation (|^, can be rewritten as 
follows: 


F{X = {x^,...,XKy,9) 




e=i 


k=l 


( 6 ) 
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The latent class model requires L x {J — K) + {K — 1) independent parameters, i.e. a 
number that linearly increases with the number of categories. 

|T6] reviews in detail different multiple imputation methods using a latent class model. 
These methods can be distinguished by the way used to reflect the uncertainty on the pa¬ 
rameters of the imputation model and by the way that the number of components of the 
mixture is chosen: automatically or a priori. The quality of the imputation is quite sim¬ 
ilar from one method to another, the main differences remain in computation time. One 
of the latest contributions in this family of methods uses a non-parametric extension of the 
model namely the Dirichlet process mixture of products of multinomial distributions model 
(DPMPM) [T71 HH]. This method uses a fully Bayesian approach in which the number of 
classes is defined automatically and is not too computationally intensive. DPMPM assumes 
a prior distribution on 6z = without fixing the number of classes which is supposed 

to be infinite. More precisely, the prior distribution for 9z is defined as follows: 

g<i 

0 ~ ( 8 ) 

a ~ ^(.25, .25) (9) 

where Q refers to the gamma distribution, a is a positive real number, B refers to the beta 
distribution; the prior distribution for 9x is defined by: 

( 10 ) 

corresponding to a uniform distribution over the simplex defined by the constraint of sum 
to one. The posterior distribution of 9 is not analytically tractable, even when no missing 
value occur. However, the distribution of each parameter is known if the others are given. 
For this reason, a Gibbs sampler is used to obtain a draw from the posterior distribution. 
The principle of this is to draw each parameter while fixing the others. From an incomplete 
data set, missing values require to be preliminarily imputed. More precisely, a draw from 
the posterior distribution is obtained as follows: first, the parameters and missing values 
are initialized; then, given the current parameters, particularly 9z and 9x, each individual 
is randomly affected to one class according to its categories; next, each parameter {9z, 9x, 
a) is drawn conditionally to the others; finally, missing values are imputed according to the 
mixture model. These steps are then repeated until convergence (for more details, see [TH]). 

Despite the infinite number of classes, the prior on 6*^ typically implies that the posterior 
distribution for is non negligible for a hnite number of classes only. Moreover, for compu¬ 
tational reasons, the number of classes has to be bounded. Thus, [18] recommends to £x the 
maximum number of latent classes to twenty. Consequently, the simulated values of 9 are 
some realisations of an approximated posterior distribution only. 

Multiple imputation using the latent class model has the advantages and drawbacks of 
this model: because the latent class model approximates quite well any kind of relationships 
between variables, MI using this model enables the use of complex analysis models such as 
logistic regression with some interaction terms and provides good estimates of the parame¬ 
ters of the analysis model. However, the imputation model implies that given a class, each 
individual is imputed in the same way, whatever the categories taken. If the class is very 
homogeneous, all the individuals have the same observed values, and this behaviour makes 
sense. However, when the number of missing values is high and when the number of vari¬ 
ables is high, it is not straightforward to obtain homogeneous classes. It can explain why 
[T6] observed that the multiple imputation using the latent class model can lead to biased 
estimates for the analysis model in such cases. 
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2.3 Multiple imputation using a multivariate normal distribution 

Another popular strategy to perform MI for categorical data is to adapt the methods de¬ 
veloped for continuous data. Because multiple imputation using the normal multivariate 
distribution is a robust method for imputing continuous non-normal data [H], imputation 
using the multivariate normal model is an attractive method for this. The principle consists 
in recoding the categorical variables as dummy variables and applying the multiple imputa¬ 
tion under the normal multivariate distribution on the recoded data. The imputed dummy 
variables are seen as a set of latent continuous variables from which categories can be indepen¬ 
dently derived. More precisely, let Z/xj denote the disjunctive table coding for X/xx, he., 
the set of dummy variables corresponding to the incomplete matrix. Note that one missing 
value on implies qu missing values for z^. The following procedure implemented in [T^ 120] 
enables the multiple imputation of a categorical data set using the normal distribution: 


• perform a non-parametric bootstrap on Z; sample the rows of Z with replacement M 

times. M incomplete disjunctive tables are obtained; 

• estimate the parameters of the normal distribution on each bootstrap replicate: calcu¬ 
late the ML estimators of (/Xm, S^,), the mean and the variance of the normal distribu¬ 
tion for the bootstrap incomplete replicate, using an EM algorithm. Note that the 
set of M parameters reflects the uncertainty required for a proper multiple imputation 
method; 

• create M imputed disjunctive tables: impute Z from the normal distribution us¬ 
ing observed values of Z. M imputed disjunctive tables 

{7^m)i<rn<M obtained. In Z^, the observed values are still zeros and ones, whereas 
the missing values have been replaced by real numbers; 


• create M imputed categorical data sets: from the latent continuous variables contained 
in {7im)i<::ra<Mi derive categories for each incomplete individual. 


Several ways have been proposed to get the imputed categories from the imputed contin¬ 
uous values. For example |2I] recommends to attribute the category corresponding to the 
highest imputed value, while |22H21] propose some rounding strategies. However, “A single 
best rounding rule for categorical data has yet to he identified.” [25l p. 107]. A common 
one proposed by 123 is called Coin flipping. Coin flipping consists in considering the set of 
imputed values of the qk dummy variables as an expectation given the observed values 


6'a, = E 


(^1, ■ ■ • 5 ^qk) l-^obsi E 


Thus, randomly drawing one category according to a multi¬ 


nomial distribution M. {6k, 1), suitably modihed so that 6k remains between 0 and 1, imputes 
plausible values. The values lower than 0 are replaced by 0 and the imputed values higher 
than 1 are replaced by 1. In this case, the imputed values are scaled to respect the constraint 
of sum to one. 

Because imputation under the normal multivariate distribution is based on the estimate 
of a covariance matrix, the imputation under the normal distribution can detect only two-way 
associations between categorical variables. In addition, this method assumes independence 
between categories conditionally to the latent continuous variables. This implies that if two 
variables are linked, and if an individual has missing values on these ones, then the categories 
derived from the imputed disjunctive table will be drawn independently. Consequently, the 
two-way associations can not be perfectly reflected in the imputed data set. Note that, 
contrary to the MI using the latent class, the parameter of the multinomial distribution 6k 
is specihc to each individual, because the imputation of the disjunctive table is performed 


7 




given the observed values. This behaviour makes sense if the variables on which missing 
values occur are linked with the others. The main drawback of the MI using the normal 
distribution is the number of independent parameters estimated. This number is equal to 
{j-K)x{j-K+i) _|_ j _ representing 860 parameters for a data set with 10 variables with 
5 categories. It increases rapidly when the total number of categories (J) increases, leading 
quickly to overhtting. Moreover, the covariance matrix is not invertible when the number 
of individuals is lower than {J — K). To overcome these issues, it is possible to add a ridge 
term on its diagonal to improve the conditioning of the regression problem. 


2.4 Fully conditional specification 

Categorical data can be imputed using a FCS approach instead of a JM approach: for each 
variable with missing values, an imputation model is dehned, {i.e. a conditional distribution), 
and each incomplete variable is sequentially imputed according to this, while reflecting the 
uncertainty on the model’s parameters. Typically, the models used for each incomplete 
variable are some multinomial logistic regressions and the variability of the models parameter 
is reflected using a Bayesian point of view. More precisely, we denote by 6k = the 

set of parameters for the multinomial distribution of the variable to impute Xk (the set of the 
other variables is denoted X_fc). We also denote by jSk = (/3fei, ■ ■ ■ the set of regression 

parameters that dehnes 6k, such as f3k£ is the regression parameter vector associated with 
the category i of the response variable Xk and is the design matrix associated. Note 
that identihability constraints are required on /3^, that is why is hxed to the null vector. 
Thus, the imputation is built on the following assumptions: 


6ki = nxk = i\x.k,^) 


Xk\6kr~^ M{6k,l) 

(11) 

exp{Zk(3ke) 

1121 



I3\X~M(Av) 

(13) 


where jS, V are the estimators of {3 and of its associated variance. For simplicity, suppose 
that the data set contains 2 binary variables xi and X 2 , with X 2 as incomplete and xi as 
complete. To impute X 2 given xi the hrst step is to estimate [3 and its associated variance 
using complete cases by iteratively reweighted least squares. Then, a new parameter f3k is 
drawn from a normal distribution centred in the previous estimate with the covariance matrix 
previously obtained. Lastly, the htted probability 6k are obtained from the logistic regression 
model with parameter Pk and X 2 is imputed according to a multinomial distribution with 
parameters 6k [251 p.76]. Note that f3 is drawn in an approximated posterior distribution. 
Indeed, as explained by jH p. 169-170], the posterior distribution has not a neat form for 
reasonable prior distributions. However, on a large sample, assuming a weak prior on [3, 
the posterior distribution can be approximated by a normal distribution. Thus, draw /? in a 
normal distribution with /3 and V as parameters makes sense. 

In the general case, where the data set contains K variables with missing values, each 
variable is imputed according to a multinomial logistic regression given all the others. More 
precisely, the incomplete data set is hrstly randomly imputed. Then, the missing values of the 
variable x^ are imputed as explained previously: a value of 13k is drawn from the approximated 
posterior distribution and an imputation according to P {Xk\X_k, 6k) is performed. The next 
incomplete variable is imputed in the same way given the other variables, and particularly 
from the new imputed values of x^. We proceed in this way for all variables and repeat it 




until convergence, this provides one imputed data set. The procedure is performed M times 
in parallel to provide M imputed data sets. 

Implicitly, the choices of the conditional distributions P(Xfc|X_fc;0fc) determine a joint 
distribution P (Xfc;6*), in so far as a joint distribution is compatible with these choices [26] . 
The convergence to the joint distribution is often obtained for a low number of iterations (5 
can be sufficient), but [251 p.ll3] underlines that this number can be higher in some cases. 
In addition, FCS is more computationally intensive than JM [T5ll25] . This is not a practical 
issue when the data set is small, but it becomes so on a data set of high dimensions. In 
particular, checking the convergence becomes very difficult. 

The imputation using logistic regressions on each variable performs quite well, that is 
why this method is often used as a benchmark to perform comparative studies [ZlilHlETlEB]. 
However, the lack of multinomial regression can affect the multiple imputation procedure 
using this model. Indeed, when separability problems occur ra, or when the number of 
individuals is smaller than the number of categories [121 P-195], it is not possible to get the 
estimates of the parameters. In addition, the number of parameters is very large when the 
number of categories per variable is high, implying overfitting when the number of individ¬ 
uals is small. When the number of categories becomes too large, [251 EQ] advise to use a 
method dedicated to continuous data: the predictive mean matching (PMM). PMM treats 
each variable as continuous variables, predicts them using linear regression, and draws one 
individual among those the nearest to the predicted value. However, PMM often yields to 
biased estimates [7|. 

Typically, the default models selected for each logistic regression are main effects models. 
Thus, the imputation model captures the two-way associations between variables well, which 
is generally sufficient for the analysis model. However, models taking into account interactions 
can be used but the choice of these models requires a certain effort by the user. To overcome 
this effort, in particular when the variables are numerous, conditional imputations using 
random forests instead of logistic regression have been proposed [27112S]. According to [27], 
an imputation of one variable Xk given the others is obtained as follows: 

• draw 10 bootstrap samples from the individuals without missing value on 

• fit one tree on each sample: for a given bootstrap sample, draw randomly a subset of 
y/K — 1 variables among the K — 1 explanatory variables. Build one tree from this 
bootstrap sample and this subset of explanatory variables. A random forest of 10 trees 
is obtained. Note that the uncertainty due to missing values is reflected by the use of 
one random forest instead of a unique tree; 

• impute missing values on Xk according to the forest: for an individual i with a missing 
value on Xk, gather all the donors from the 10 predictive leaves from each tree and 
draw randomly one donor from it. 

Then, the procedure is performed for each incomplete variable and repeated until convergence. 
Using random forests as conditional models allows capturing complex relationships between 
variables. In addition, the method is very robust to the number of trees used, as well as to 
the number of explanatory variables retained. Thus, the default choices for these parameters 
(10 trees, y/K — 1 explanatory variables) are very suitable in most of the cases. However, 
the method is more computationally intensive than the one based on logistic regressions. 
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3 Multiple Imputation using multiple correspondence 
analysis 

This section deals with a novel MI method for categorical data based on multiple corre¬ 
spondence analysis (MCA) [3ll [32], i.e. the principal components method dedicated for 
categorical data. Like the imputation using the normal distribution, it is a JM method based 
on the imputation of the disjunctive table. We hrst introduce MCA as a specihc singular 
value decomposition on specihc matrices. Then, we present how to perform this SVD with 
missing values and how it is used to perform single imputation. We explain how to intro¬ 
duce uncertainty to obtain a proper MI method. Finally, the properties of the method are 
discussed and the differences with MI using the normal distribution highlighted. 


3.1 MCA for complete data 

MCA is a principal components method to describe, summarise and visualise multidimen¬ 
sional matrices with categorical data. This powerful method allows us to understand the 
two-way associations between variables as well as the similarities between individuals. Like 
any principal components method, MCA is a method of dimensionality reduction consisting 
in searching for a subspace of dimension S providing the best representation of the data in 
the sense that it maximises the variability of the projected points [i.e. the individuals or the 
variables according to the space considered). The subspace can be obtained by performing a 
specihc singular value decomposition (SVD) on the disjunctive table. 

More precisely, let Z/xj denote the disjunctive table corresponding to X/xi^- We dehne 
a metric between individuals through the diagonal matrix where 

Ds = diag (pib ..., Pqj,..., Pi"^,..., Pq^) is a diagonal matrix with dimensions J x J, 
is the proportion of observations taking the category i on the variable x*,. In this way, two 
individuals taking different categories for the same variable are more distant from the others 
when one of them takes a rare category than when both of them take frequent categories. 
We also dehne a uniform weighting for the individuals through the diagonal matrix ^1/ with 
1/ the identity matrix of dimensions I. By duality, the matrices and ^1/ dehne also a 

weighting and a metric for the space of the categories respectively. MCA consists in searching 
a matrix Z with a lower rank S as close as possible to the disjunctive table Z in the sense 
dehned by these metrics. Let M/xj denote the matrix where each row is equal to the vector 
of the means of each column of Z. MCA consists in performing the SVD of the matrix triplet 
(Z — M, jli) [33] which is equivalent to writing (Z — M) as 

Z - M = UA^/^V^ (14) 

where the columns of U/x j are the left singular vectors satisfying the relationship 
U^diag(l/I,...,l/I)U = lj ; columns of Vjx j are the right singular vectors satisfying the 

relationship V^T]3-1 y = Ij and Aj{,^j = diag ..., A^^^ is the diagonal matrix of 

the singular values. 

'^ 1/2 

The S hrst principal components are given by \JixS-^sxSi product between the hrst 
columns of U and the diagonal matrix A^/^ restricted to its S hrst elements. In the same 
way, the S hrst loadings are given by Vjxg. Z dehned by: 

Z = UAV^ + M (15) 


is the best approximation of Z, in the sense of the metrics, with the constraint of rank S 
(Eckart-Young theorem [M])- Equation (15) is called reconstruction formula. 
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Note that, contrary to Z, Z is a fuzzy disjunctive table in the sense that its cells are real 
numbers and not only zeros and ones as in a classic disjunctive table. However, the sum 
per variable is still equal to one [35]. Most of the values are contained in the interval [0,1] 
or close to it because Z is as close as possible to Z which contains only zeros and ones, but 
values out of this interval can occur. 

Performing MCA requires J — K parameters corresponding to the terms useful for the 
centering and the weighting of the categories, /S'—S'—for the centered and orthonormal 
left singular vectors and {J — K)S — S — for the orthonormal right singular vectors, 

for a total of J — K + S {I — 1 + {J — K) — S) independent parameters. This number of 
parameters increases linearly with the number of values in the data set. 


3.2 Single imputation using MCA 

[36] proposed an iterative algorithm called “iterative MCA” to perform single imputation 
using MCA. The main steps of the algorithm are as follows: 


1. initialization / = 0: recode X as disjunctive table Z, substitute missing values by initial 
values (the proportions) and calculate M° and on this completed data set. 

2. step i: 

(a) perform the MCA, in other words the SVD of ^Z^“^ — T ^, jl^ 

A. I ^ g / ^ \ 1/2 

to obtain U , V and ( 


(b) keep the S hrst dimensions and use the reconstruction formula (15) to compute 
the htted matrix: 






+ M 


e-i 

IxJ 


and the new imputed data set becomes T} = W * Z + (1 — W) * Z with * being 
the Hadamard product, 1/xj being a matrix with only ones and W a weighting 
matrix where Wij = 0 if Zi^ is missing and Wij = 1 otherwise. The observed values 
are the same but the missing ones are replaced by the htted values; 

(c) from the new completed matrix Z^, D|; and are updated. 


3. steps (2.a), (2.b) and (2.c) are repeated until the change in the imputed matrix falls 


below a predehned threshold ^ 


ij 


ZijY < £, with e: equals to 10 ® for example. 


The iterative MCA algorithm consists in recoding the incomplete data set as an incomplete 
disjunctive table, randomly imputing the missing values, estimating the principal components 
and loadings from the completed matrix and then, using these estimates to impute missing 


values according to the reconstruction formula (15). The steps of estimation and imputation 


are repeated until convergence, leading to an imputation of the disjunctive table, as well as 
to an estimate of the MCA parameters. 

The algorithm can suffer from overhtting issues, when missing values are numerous, when 
the relationships between variables are weak, or when the number of observations is low. To 
overcome these issues, a regularized version of it has been proposed [36]. The rationale is 
to remove the noise in order to avoid instabilities in the prediction by replacing the singular 
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values 


of step (2.b) by shrunk singular values 


l<s<5 




=S+1 J-K-S 




In this 


l<s<S 


way, singular values are thresliolded with a greater amount of shrinkage for the smallest ones. 
Thus, the hrst dimensions of variability take a more signihcant part in the reconstruction 
of the data than the others. Assuming that the hrst dimensions of variability are made of 
information and noise, whereas the last ones are made of noise only, this behaviour is then 
satisfactory. Geometrically, the regularization makes the individual closer to the center of 
gravity. Concerning the cells of Z, the regularization makes the values closer to the mean 
proportions and consequently, these values are more often in the interval [0,1]. 


The regularized iterative MCA algorithm enables us to impute an incomplete disjunctive 
table but not an initial incomplete data set. A strategy to go from the imputed disjunctive 
table to an imputed categorical data set is required. We also suggest the use of the coin 
hipping approach. Let us note that for each set of dummy variables coding for one categorical 
variable, the sum per row is equal to one, even if it contains imputed values. Moreover, most 
of the imputed cells are in the interval [0,1] or are close to it. Consequently, modihcations 
of these cells are not often required. 


3.3 MI using MCA 


To perform MI using MCA, we need to rehect the uncertainty concerning the principal 
components and loadings. To do so, we use a non-parametric bootstrap approach based on 
the specihcities of MCA. Indeed, as seen in Section 3.1[ MCA enables us to assign a weight 
to each individual. This possibility to include a weight for the individual is very useful when 
the same lines of the data set occur several times. Instead of storing each replicate, a weight 
proportional to the number of occurrences of each line can be used, allowing the storage 
only of the lines that are different. Thus, a non-parametric bootstrap, such as the one used 
for the MI using the normal distribution, can easily be performed simply by modifying the 
weight of the individuals: if an individual does not belong to the bootstrap replicate, then its 
weight is null, otherwise, its weight is proportional to the number of times the observation 
occurs in the replicate. Note that individuals with a weight equal to zero are classically called 
supplementary individuals in the MCA framework [33] . 

Thus, we dehne a MI method called multiple imputation using multiple correspondence 
analysis (MIMCA). First, the algorithm consists in drawing M sets of weights for the indi¬ 
viduals. Then, M single imputations are performed: at hrst, the regularized iterative MCA 
algorithm is used to impute the incomplete disjunctive table using the previous weighting 
for the individuals; Next, coin hipping is used to obtain categorical data and mimic the dis¬ 
tribution of the categorical data. At the end, M imputed data sets are obtained and any 
statistical method can be applied on each one. In detail, the MIMCA algorithm is written 
as follows: 


1. Rehect the variability on the set of parameters of the imputation model: draw / values 
with replacement in {!,..,/} and dehne a weight r* for each individual proportional to 
the number of times the individual i is drawn. 

2. Impute the disjunctive table according to the previous weighting: 

(a) initialization £ = 0: recode X as a disjunctive table Z, substitute missing values 
by initial values (the proportions) and calculate M° and on this completed 
data set. 
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(b) step t. 

i. perform the SVD of ^ > diag (ri,... ,r/)j to obtain 

and 

ii. keep the S hrst dimensions and compnte the htted matrix: 

z' = (^ti' (vy) + M'-‘ 

/'''f 

where ( ^shrunk ) is diagonal matrix containing the shrunk singular val¬ 
ues and derive the new imputed data set = W * Z -t- (1 — W) * Z 

iii. from the new completed matrix Z^, and are updated. 

(c) step (2.b) is repeated until convergence. 

3. Mimic the distribution of the categorical data set using coin flipping on Z^ : 

(a) if necessary, modify suitably the values of Z^: negative values are replaced by zero, 
and values higher than one are replaced by one. Then, for each set of dummy 
variables coding for one categorical variable, scale in order to verify the constraint 
that the sum is equal to one. 

(b) for imputed cells coding for one missing value, draw one category according to a 
multinomial distribution. 

4. Create M imputed data sets: for m from 1 to M alternate steps 1, 2 and 3. 

3.4 Properties of the imputation method 

MI using MCA is part of the family of joint modelling MI methods, which means that it 
avoids the runtime issues of conditional modelling. Most of the properties of the MIMCA 
method are directly linked to MCA properties. MCA provides an efficient summary of 
the two-way associations between variables, as well as the similarities between individuals. 
The imputation benefits from these properties and provides an imputation model sufficiently 
complex to apply then an analysis model focusing on two-way associations between variables, 
such as a main effects logistic regression model. In addition, like the MI using the normal 
distribution, MIMCA uses draws from a multinomial distribution with parameter 9 (obtained 
by the disjunctive table) specific to each individual and depending on the observed values of 
the other variables. Lastly, because of the relatively small number of parameters required to 
perform MCA, the imputation method works well even if the number of individuals is small. 
These properties have been highlighted in previous works on imputation using principal 
components methods [iniEZ]. 

Since these two methods, MIMCA and the multiple imputation with the normal distri¬ 
bution, provide several imputations of the disjunctive table, and then use the same strategy 
to go from the disjunctive table to the categorical data set, they seem very close. However, 
they differ on many other points. 

The first one is that the imputation of the disjunctive table by MCA is a deterministic 
imputation, replacing a missing value by the most plausible value given by the estimate of 
the principal components and the estimate of the loadings. Then, coin flipping is used to 
mimic the distribution of the categorical data. On the contrary, the multiple imputation 
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based on the normal distribntion nses stochastic regressions to impnte the disjunctive table, 
that is to say, a Gaussian noise is added to the conditional expectation given by the observed 
values. Then, coin flipping is used, adding uncertainty a second time. 

The second difference between the two methods is the covariance of the imputed values. 

Indeed, the matrix Z contains the reconstructed data by the iterative MCA algorithm and 

the product Z Z provides the covariance matrix of this data. The rank of it is S. On the 
contrary, the rank of the covariance matrix used to perform imputation using the normal 
distribution is J — K (because of the constraint of the sum equal to one per variable). 
Consequently, the relationships between imputed variables are different. 

The third difference is the number of estimated parameters. Indeed, although the impu¬ 
tation by the normal distribution requires a extremely large number of parameters when the 
number of categories increases, the imputation using MCA requires a number of parameters 
linearly dependent to the number of cells. This property is essential from a practical point of 
view because it makes it very easy to impute data sets with a small number of individuals. 


4 Simulation study 

As mentioned in the introduction, the aim of MI methods is to obtain an inference on a 
quantity of interest '0- Here, we focus on the parameters of a logistic regression without 
interaction, which is a statistical method frequently used for categorical data. At hrst, we 
present how to make inference for the parameters from multiple imputed data sets. Then, 
we explain how we assess the quality of the inference built, that is to say, the quality of the 
MI methods. Finally, the MI methods presented in Sections and are compared through a 
simulation study based on real data sets. It thus provides more realistic performances from 
a practical point of view. The code to reproduce all the simulations with the R software |3H] , 
as well as the data sets used, are available on the webpage of the hrst author. 


4.1 Inference from imputed data sets 

Each MI method gives M imputed data sets as outputs. Then, the parameters of the analysis 
model (for instance the logistic regression) as well as their associated variance are estimated 

from each one. We denote (V'm ) the set of the M estimates of the model’s parameters 

\ / l<m<M 

and we denote ( V ar ii/jm) ) the set of the M associated variances. These estimates 

\ \ / / l<m<M 

have to be pooled to provide a unique estimate of -0 and of its variance using Rubin’s rules 

0 

This methodology is explained for a scalar quantity of interest 'ip. The extension to a 
vector is straightforward, proceeding in the same way element by element. The estimate of 
pj is simply given by the mean over the M estimates obtained from each imputed data set: 





m=l 


( 16 ) 


while the estimate of the variance of ip is the sum of two terms: 

ror(v-) = 1] (t’™) + (1 + M) Jfzrj Y1 (t’". -1’) ■ (U) 

m=l ^ ^ m=l 
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The first term is the within-imputation variance, corresponding to the sampling variance. 
The second one is the between-impntation variance, corresponding to the variance dne to 
missing valnes. The factor (1 + ^) is dne to the fact that tfj is estimated from a finite nnmber 
of impnted tables. 

Then, the 95% confidence interval is calcnlated as: 

where is the .975 critical valne of the Stndent’s t—distribntion with z/ degrees of freedom 
estimated as snggested by [5U] . 

4.2 Simulation design from real data sets 

The validity of MI methods are often assessed by simnlation [23 P-47]. We design a simnla- 
tion stndy nsing real data sets to assess the qnality of the MIMCA method. Each data set is 
considered as a popnlation data and denoted Xpop. The parameters of the logistic regression 
model are estimated from this popnlation data and they are considered as the trne coeffi¬ 
cients ijj. Then, a sample X is drawn from the popnlation. This step reflects the sampling 
variance. The valnes of the response variable of the logistic model are drawn according to 
the probabilities dehned dehned by -0. Then, incomplete data are generated completely at 
random to reflect the variance dne to missing valnes m. The MI methods are applied and 
the inferences are performed. This procednre is repeated T times. 

The performances of a MI method are measnred according to three criteria: the bias given 
by ^ Er=i (^i — , the median (over the T simnlations) of the confidence intervals width 

as well as the coverage. This latter is calcnlated as the percentage of cases where the trne 
valne pj is within its 95% confidence interval. 

A coverage snfficiently close to the nominal level is reqnired to consider that the inference 
is correct, bnt it is not snfficient, the confidence interval width shonld be as small as possible. 

To appreciate the valne of the bias and of the width of the confidence interval, it is 
nsefnl to compare them to those obtained from two other methods. The first one consists 
in calcnlating the criteria for the data sets withont missing valnes, which we named the 
“Fnll data” method. The second one is the listwise deletion. This consists in deleting the 
individnals with missing valnes. Becanse the estimates of the parameters of the model are 
obtained from a snbsample, the conhdence intervals obtained shonld be larger than those 
obtained from mnltiple impntation. 

A single impntation method (named Sample) is added as a benchmark to nnderstand 
better how MI methods benefit from nsing the relationships between variables to impnte 
the data. This single impntation method consists in drawing each category according to a 
mnltinomial distribntion A4 {6, 1), with 6 defined according to the proportion of each category 
of the cnrrent variable. 

4.3 Results 

The methods described in this paper are performed nsing the following R packages: cat [H] 
for MI nsing the satnrated loglinear model, Amelia [T^[2n] for MI nsing a normal distribntion, 
mi ws for MI nsing the DPMPM method, mice [301H3] for the PCS approach nsing iterated 
logistic regressions and random forests. This package will also be nsed to pool the resnlts 
from the impnted data sets. The tnning parameters of each MI method are chosen according 
to their defanlt valnes implemented in the R packages. Firstly, the tnning parameter of the 
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MIMCA method, that is to say, the number of components, is chosen to provide accurate 
inferences. Its choice will be discussed later in Section 14.3.31 

The MI methods are assessed in terms of the quality of the inference as well as the time 
consumed from data sets covering many situations. The data sets differ in terms of the 
number of individuals, the number of variables, the number of categories per variable, the 
relationships between variables. 

The evaluation is based on the following categorical data sets. For each data set a cate¬ 
gorical response variable is available. 

• Saheart: This data set m provides clinical attributes of Ipop = 462 males of the 
Western Cape in South Africa. These attributes can explain the presence of a coronary 
heart disease. The data set contains iC = 10 variables with a number of categories 
between 2 and 4. 

• Galetas'. This data set [15] refers to the preferences of Ipop = 1192 judges regarding 
11 cakes in terms of global appreciation and in terms of color aspect. The data set 
contains K = 4 variables with two that have 11 categories. 

• Sbp: The Ipop = 500 subjects of this data set are described by clinical covariates 
explaining their blood pressure [46]. The data set contains K = 18 variables that have 
2 to 4 categories. 

• Income: This data set, from the R package/sern/a^ m, contains Ipop = 6876 individuals 
described by several demographic attributes that could explain the annual income of 
an household. The data set contains K = 14 variables with a number of categories 
between 2 and 9. 

• Titanic: This data set jlH] provides information on Ipop = 2201 passengers on the ocean 
liner Titanic. The K = 4 variables deal with the economic status, the sex, the age and 
the survival of the passengers. The hrst variable has four categories, while the other 
ones have two categories. The data set is available in the R software. 

• Credit: German Credit Data from the UCI Repository of Machine Learning Database 
[49] contains Ipop = 982 clients described by several attributes which enable the bank to 
classify themselves as good or bad credit risk. The data set contains K = 20 variables 
with a number of categories between 2 and 4. 

The simulation design is performed for T = 200 simulations and 20% of missing values 
generated completely at random.The MI methods are performed with M = 5 imputed data 
sets which is usually enough [4|. 

4.3.1 Assessment of the inferences 

First of all, we can note that some methods cannot be applied on all the data sets. As 
explained previously, MI using the loglinear model can be applied only on data sets with 
a small number of categories such as Titanic or Galetas. MI using the normal distribution 
encounters inversion issues when the number of individuals is small compared to the number 
of variables. That is why no results are provided for MI using the normal distribution on the 
data sets Credit and Sbp. The others MI methods can be applied on all the data sets. 

For each data set and each method, the coverages of all the conhdence intervals of the 
parameters of the model are calculated from T simulations (see Table in Appendix for 
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more details on these models). All the coverages are summarized with a boxplot (see Figure 
2). The results for the bias and the conhdence interval width are presented in Figure]^ and 
5 in Appendix 
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Figure 2: Distribution of the coverages of the confidence intervals for all the parameters, for several 
methods (Listwise deletion, Sample, Loglinear model. Normal distribution, DPMPM, 
MIMCA, FCS using logistic regressions, FCS using random forests, Full data) and for 
different data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit). The horizontal 
dashed line corresponds to the lower bound of the 95% confidence interval for a pro¬ 
portion of 0.95 from a sample of size 200 according to the Agresti-Coull method |50] . 
Coverages under this line are considered as undesirable. 


As expected, MI using the loglinear model performs well on the two data sets where it 
can be applied. The coverages are close to the nominal levels, the biases are close to zero, 
and the conhdence interval widths are small. 

MI using the non-parametric version of the latent class model performs quite well since 
most of the quantities of interest have a coverage close to 95%. However, some inferences 
are incorrect from time to time such as on the data set Credit or Titanic. This behaviour 
is in agreement with the study of [IH] which also presents some unsatisfactory coverages. 
[TB] note that this MI model can have some difficulties in capturing the associations among 
the variables, particularly when the number of variables is high or the relationships between 
variables are complex, that can explain the poor coverages observed. Indeed, on the data 
set Credit^ the number of variables is the highest among the data sets considered, while on 
the data set Titanic, the relationships between variables can be described as complex, in 
the sense that the survival status of the passengers is linked to all the other variables, but 
these are not closely connected. Moreover, the very poor coverages for the method Sample 
indicates that the imputation model has to take into account these relationships to provide 
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confidence intervals that reach the nominal rate. 

MI nsing the normal distribntion can be applied on three data sets only. On these data 
sets, the coverages can be too small (see Titanic in Figure [^. This highlights that despite 
the fact that this method is still often used in practice to deal with incomplete categorical 
data, it is not suitable and we do not recommend using such a strategy. However, [B] showed 
that this method could be used to impute mixed data (he. with continuous and categorical 
data) but only continuous variables contain missing values. 

The FCS using logistic regressions encounters difficulties on the data sets with a high 
number of categories such as Galetas and Income. This high number of categories implies a 
high number of parameters for each conditional model that may explain the undercoverage 
on several quantities. 

The FCS using random forests performs well and the method encounters difficulties only 
on the Titanic data set. This behaviour can be explained by the step of subsampling variables 
in the imputation algorithm (Section 2.4), he., each tree is built with potentially different 
variables and with a smaller number than {K — 1). In the Titanic data set, the number 
of variables is very small and the relationships between the variables are weak and all the 
variables are important to predict the survival response. Thus, it introduces too much bias 
in the individual tree prediction which may explain the poor inference. Even if, in the 
most practical cases, MI using random forests is very robust to the misspecification of the 
parameters, on this data set, the inference could be improved in increasing the number of 
explanatory variables retained for each tree. 

Concerning MI using MCA, all the coverages observed are satisfying. The conhdence 
interval width is of the same order of magnitude than the other MI methods. In addition, 
the method can be applied whatever the number of categories per variables, the number of 
variables or the number of individuals. Thus, it appears to be the easiest method to use to 
impute categorical data. 


4.3.2 Computational efficiency 

MI methods can be time consuming and the running time of the algorithms could be con¬ 
sidered as an important property of a MI method from a practical point of view. Table [T] 
gathers the times required to impute M = 5 times the data sets with 20% of missing values. 



Saheart 

Galetas 

Sbp 

Income 

Titanic 

Credit 

Loglinear 

NA 

4.597 

NA 

NA 

0.740 

NA 

DPMPM 

20.050 

17.414 

56.302 

143.652 

10.854 

24.289 

Normal 

0.920 

0.822 

NA 

26.989 

0.483 

NA 

MIMCA 

5.014 

8.972 

7.181 

58.729 

2.750 

8.507 

FCS log 

20.429 

38.016 

53.109 

881.188 

4.781 

56.178 

FCS forests 

91.474 

112.987 

193.156 

6329.514 

265.771 

461.248 


Table 1: Time consumed (in seconds) to impute data sets (Saheart, Galetas, Sbp, Income, Titanic, 
Credit), for different methods (Loglinear model, DPMPM, Normal distribution, MIMCA, 
FCS using logistic regressions, FCS using random forests). The imputation is done for 
M = 5 data sets. Calculation has been performed on an Intel(R)Core™2 Duo CPU E7500, 
running Ubuntu 12.04 LTS equipped with 3 GB ram. Some values are not provided 
because all methods cannot be performed on each data set. 
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First of all, as expected, the FCS method is more time consuming than the others based 
on a joint model. In particular, for the data set Income, where the number of individuals 
and variables is high, the FCS using random forests requires 6,329 seconds {i.e. 1.75 hours), 
illustrating substantial running time issues. FCS using logistic regressions requires 881 sec¬ 
onds, a time 6 times higher than MI using the latent class model, and 15 times higher than 
MI method using MCA. Indeed, the number of incomplete variables increases the number of 
conditional models required, as well as the number of parameters in each of them because 
more covariates are used. In addition, the time required to estimate its parameters is non- 
negligible, particularly when the number of individuals is high. Then, MI using the latent 
class model can be performed in a reasonable time, but this is at least two times higher 
than the one required for MI using MCA. Thus, the MIMCA method should be particularly 
recommended to impute data sets of high dimensions. 

Having a method which is not too expensive enables the user to produce more than the 
classical M = 5 imputed data sets. This could lead to a more accurate inference. 


4.3.3 Choice of the number of dimensions 


MCA requires a predehned number of dimensions S which can be chosen by cross-validation 
|3B] . Cross-validation consists in searching the number of dimensions S minimizing an error 
of prediction. More precisely, missing values are added completely at random to the data set 
X. Then, the missing values of the incomplete disjunctive table Z are predicted using the 
regularized iterative MCA algorithm. The mean error of prediction is calculated according 
Card(M) Yli{ij)eu ~ where U denotes the set of the added missing values. The pro¬ 
cedure is repeated k times for a predehned number of dimensions. The number of dimensions 
retained is the one minimizing the mean of the k mean errors of prediction. This procedure 
can be used whether the data set contains missing values or not. 

To evaluate how the choice of S impacts on the quality of the inferences, we perform 
the MIMCA algorithm varying the number of dimensions around the one provided by cross- 
validation. Figure presents how this tuning parameter inhuences the coverages in the 
previous study. The impacts on the width of the conhdence intervals are reported in Figure 
[^and the ones on the bias in Figure in Appendix [Bj 

Except for the data set Titanic, the coverages are stable according to the number of 
dimensions retained. In particular, the number of dimensions suggested by cross-validation 
provides coverages close to the nominal level of the conhdence interval. In the case of the data 
set Titanic, the cross-validation suggests retaining 5 dimensions, which is the choice giving 
the smallest conhdence intervals, while giving coverages close to 95%. But retaining less 
dimensions leads to worse performances since the covariates are not closely related (Section 


4.3.1). Indeed, these covariates can not be well represented within a space of low dimensions. 


Consequently, a high number of dimensions is required to rehect the useful associations to 
impute the data. Titanic illustrates that underhtting can be problematic. The same comment 
is made by na who advise choosing a number of classes sufficiently high in the case of MI 
using the latent class model. However, overhtting is less problematic because it increases the 
variance, but it does not skip the useful information. 


5 Conclusion 

This paper proposes an original MI method to deal with categorical data based on MCA. 
The principal components and the loadings that are the parameters of the MCA enables the 
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Saheart 


Sbp 



Figure 3: Distribution of the coverages of the confidence intervals for all the parameters for the 
MIMCA algorithm for several numbers of dimensions and for different data sets (Sa¬ 
heart, Galetas, Sbp, Income, Titanic, Credit). The results for the number of dimensions 
provided by cross-validation are in grey. The horizontal dashed line corresponds to the 
lower bound of the 95% confidence interval for a proportion of 0.95 from a sample of size 
200 according to the Agresti-Coull method [50]. Coverages under this line are considered 
as undesirable. 


imputation of data. To perform MI, the uncertainty on these parameters is reflected using a 
non-parametric bootstrap, which results in a specific weighting for the individuals. 

From a simulation study based on real data sets, this MI method has been compared to 
the other main available MI methods for categorical variables. We highlighted the compet¬ 
itiveness of the MIMCA method to provide valid inferences for an analysis model requiring 
two-way associations (such as logistic regression without interaction, or a homogeneous log- 
linear model, proportion, odds ratios, etc). 

We showed that MIMCA can be applied to various conhgurations of data. In particular, 
the method is accurate for a large number of variables, for a large number of categories per 
variables and when the number of individuals is small. Moreover, the MIMCA algorithm per¬ 
forms fairly quickly, allowing the user to generate more imputed data sets and therefore to 
obtain more accurate inferences (M between 20 and 100 can be benehcial [2S1 P-49]). Thus, 
MIMCA is very suitable to impute data sets of high dimensions that require more computa¬ 
tion. Note that MIMCA depends on a tuning parameter (the number of components), but 
we highlighted that the performances of the MI method are robust to a misspecification of 
it. 


Because of the intrinsic properties of MCA, MI using MCA is appropriate when the 
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analysis model contains two-way associations between variables snch as logistic regression 
withont interaction. To consider the case with interactions, one solntion could be to introduce 
to the data set additional variables corresponding to the interactions. However, the new 
variable ’’interaction” is considered as a variable in itself without taking into account its 
explicit link with the associated variables. It may lead to imputed values which are not 
in agreement with each others. This topic is a subject of intensive research for continuous 
variables |5ll|52]. 

In addition, the encouraging results of the MIMCA to impute categorical data prompt 
the extension of the method to impute mixed data. The hrst research in this direction 
has shown that the principal components method dedicated to mixed data (called Factorial 
Analysis for Mixed Data) is efficient to perform single imputation, but the extension to a MI 
method requires further research. 
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Appendix 

A Simulation design: analysis models and sample char¬ 
acteristics 


Data set 

number of 
individuals 

number of 
variables 

sample size 

logistic regression model 

number of 
quantities 
of interest 

Saheart 

462 

10 

300 

CHD = FAMHIST + TO¬ 
BACCO -b ALCOHOL 

30 

Galetas 

1192 

4 

300 

GALLE = GRUPO 

6 

Sbp 

500 

18 

200 

SBP = SMOKE -b EXERCISE 
-b ALCOHOL 

12 

Income 

6876 

14 

1500 

INCOME = SEX 

8 

Titanic 

2201 

4 

300 

SURV = CLASS-bAGE-bSEX 

6 

Credit 

982 

20 

300 

CLASS = CHECK- 

ING_STATUS -b DURATION 
-b CREDIT_HISTORY -b 

PURPOSE 

11 


Table 2: Set of the sample characteristics and of the analysis models used to perform the simulation 
study (Section |4.2[ ) for the several data sets (Saheart, Galetas, Sbp, Income, Titanic, 
Credit). 


B Simluation study: complementary results 
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Figure 4; Distribution of the relative bias (bias divided by the true value) over the several 
quantities of interest for several methods (Listwise deletion, Sample, Loglinear model, 
DPMPM, Normal distribution, MIMCA, PCS using logistic regressions, PCS using ran¬ 
dom forests. Full data) for different data sets (Saheart, Galetas, Sbp, Income, Titanic, 
Credit). One point represents the relative bias observed for one coefficient. 
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Figure 5: Distribution of the median of the confidence interval for the several quantities of interest 
for several methods (Sample, Loglinear model, DPMPM, Normal distribution, MIMCA, 
PCS using logistic regressions, PCS using random forests. Full data) for different data 
sets (Saheart, Galetas, Sbp, Income, Titanic, Credit). One point represents the median 
of the confidence interval observed for one coefficient divided by the one obtained by 
Listwise deletion. The horizontal dashed line corresponds to a ratio of 1. Points over this 
line corresponds to confidence interval higher than the one obtain by listwise deletion. 




































Figure 6: Distribution of the median of the confidence interval for the several quantities of interest 
for the MIMCA algorithm for several numbers of dimensions for different data sets 
(Saheart, Galetas, Sbp, Income, Titanic, Credit). One point represents the median 
of the confidence interval observed for one coefficient divided by the one obtained by 
Listwise deletion. The horizontal dashed line corresponds to a ratio of 1. Points over this 
line corresponds to conhdence interval higher than the one obtain by listwise deletion. 
The results for the number of dimensions provided by cross-validation are in grey. 
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Figure 7 ; Distribution of the relative bias (bias divided by the true value) over the several quanti¬ 
ties of interest for the MIMCA algorithm for several numbers of dimensions for different 
data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit). One point represents the 
relative bias observed for one coefficient. The results for the number of dimensions 
provided by cross-validation are in grey. 
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